[CI] Build the test image once and pull it from ECR by hujc7 · Pull Request #5905 · isaac-sim/IsaacLab

hujc7 · 2026-06-01T22:47:42Z

1. Summary

main's build.yml rebuilds the test Docker image independently in both test-isaaclab-tasks and test-general (~23 min each, in parallel, racing the shared GHA layer cache). This builds the same image twice every run.
This change builds the image once in a dedicated build job that pushes it to ECR, then has the two test jobs depend on it and pull the prebuilt image instead of rebuilding.
Adopts the ecr-build-push-pull action already used by develop/release on the same self-hosted runner pool; it resolves the ECR URL from the runner's SSM ecr-cache-url parameter and falls back to a local build when ECR is unavailable, so no runner regresses.

2. Changes

New .github/actions/ecr-build-push-pull/action.yml — generic drop-in for docker-build with ECR-backed caching (ported verbatim from develop). Builds + pushes the commit-tagged image, or pulls it if it already exists in ECR.
.github/workflows/build.yml:
- Added a build job that builds and pushes the image to ECR once.
- test-isaaclab-tasks and test-general now needs: [build] and replace their per-job Build Docker Image step with an ECR pull of the same image-tag.
- run-tests, result copy/upload, fork-result checks, and combine-results are unchanged. Job graph: build → {test-isaaclab-tasks, test-general} → combine-results.

3. Adapted to main (not a develop bulk-copy)

Keeps main's two-job test structure, env vars, run-tests/combine-results, and the existing image (docker/Dockerfile.curobo) — this only changes how many times the image is built, not what is built.
Uses cache-tag: cache-main-curobo so main's layer cache does not collide with develop/beta2's namespace in the shared ECR repo.

4. Expected effect

The duplicate ~23-min build collapses to one build + a ~1–2 min pull per test job, and the two jobs no longer race the GHA cache.

5. Test plan

build.yml and the new action validate as YAML; job dependency graph is build → test jobs → combine-results.
Pre-commit clean.
CI run (this PR): the build job pushes once; test-isaaclab-tasks/test-general pull the prebuilt image (no rebuild). Draft until that is observed green.

Note: targets main deliberately — develop already uses this build-once/ECR pattern; this brings main's build.yml in line.

build.yml rebuilt the test image independently in both test-isaaclab-tasks and test-general (~23 min each, run in parallel and racing the GHA layer cache). Add a single build job that builds the image and pushes it to ECR via the ecr-build-push-pull action (ported from develop), and have the test jobs depend on it and pull the prebuilt image instead of rebuilding. The action resolves the ECR URL from the runner's SSM ecr-cache-url parameter and falls back to a local build when ECR is unavailable, so runners without ECR keep working as before. Preserves main's image (docker/Dockerfile.curobo), test execution, and result handling.

isaaclab-review-bot

🤖 Isaac Lab Review Bot — PR #5905

Summary

This PR eliminates a redundant ~23-min Docker image build by introducing a dedicated build job that pushes to ECR once, then has both test-isaaclab-tasks and test-general pull the prebuilt image. A new composite action (.github/actions/ecr-build-push-pull/action.yml) encapsulates the build/push/pull logic with a multi-level caching strategy (exact commit tag → deps hash → full build).

🧪 Isaac Lab Expert

Architecture: ✅ Correct build-once/pull pattern

The job dependency graph (build → {test-isaaclab-tasks, test-general} → combine-results) ensures the image is always available before test jobs start.
The cache-tag: cache-main-curobo correctly namespaces main's layer cache away from develop/beta2.
The action is ported from the already-proven develop workflow — good reuse.

Race condition risk: ✅ None identified

The needs: [build] dependency in both test jobs provides a hard barrier — no test job starts until the build job succeeds.
The previous approach had both test jobs building in parallel racing the GHA layer cache; this eliminates that race entirely.

Runner affinity consideration: ⚠️ Minor

The build job and test jobs all use [self-hosted, gpu] runners. If the test jobs land on a different runner than the build job, they will pull from ECR (not local cache). This is by design and the action handles it correctly via docker manifest inspect → pull → tag. However, if the ECR push in the build job fails silently (see below), the test jobs would fall back to a full local build — which is actually the correct degraded behavior.

🔇 Silent Failure Hunter

1. ECR push failure in build job (Step 7) — Medium risk

If docker push fails in the build job's "Push to ECR" step, it will cause the build job to fail (non-zero exit), which would block test jobs entirely due to needs: [build].
✅ This is correct behavior — a hard failure is better than a silent one.

2. ECR unavailability in test jobs — ✅ Safe

When test jobs invoke the action and ECR is unavailable, the action resolves no ECR URL and falls back to a full local build (Step 6). The || true guards on IMDS/SSM calls prevent hard failures during resolution.
However, since the build job already pushed the image, the test jobs should always find it via the commit-tagged ECR image (Step 4 manifest check → pull).

3. Docker credential store bypass — ✅ Properly handled

Step 1 creates a temp config with credsStore: "" and cleans it up in Step 9 (if: always()).

4. Potential gap: docker buildx imagetools create failure in deps-cache step (Step 5)

If docker buildx imagetools create -t "${ECR_IMAGE}" "${DEPS_ECR_IMAGE}" fails, the step would exit non-zero and the build job would fail. This is safe (fails loud).

5. || true patterns in resolution steps — Acceptable

The || true guards on IMDS token, instance ID, region, and SSM are intentional — they allow graceful degradation on non-EC2 runners.

📊 Test Coverage Analyzer

No new tests needed. This is a CI-infrastructure-only change. The existing CI run of this PR serves as the integration test (build → pull → test pattern).
The test plan in the PR description is appropriate: validate YAML, observe the green CI run.

🏁 Verdict: LGTM ✅

Well-structured CI optimization that follows the proven pattern from develop. The build-once/pull architecture is correct, race conditions are eliminated by job dependencies, and silent failures are properly handled with graceful degradation.

Minor suggestions (non-blocking):

Consider adding a brief comment in build.yml noting that ecr-build-push-pull in test jobs will effectively just pull (since build already pushed), to aid future maintainers reading the workflow.
The timeout-minutes: 180 on the build job matches the test jobs — this is generous for a single build. A tighter timeout (e.g., 60 min) could catch hangs earlier, though this is cosmetic given the existing test job timeouts.

Reviewed at commit: 7c22ca8

Update (c3c195c): New push is a merge from main into the feature branch (nightly-changelog workflow improvements, daqp bump to 0.8.5, version bump to 0.54.4). The PR's own files (.github/actions/ecr-build-push-pull/action.yml and .github/workflows/build.yml) are unchanged. Previous review still applies — LGTM holds.

Re-validated at: c3c195c | 2026-06-02T18:14 UTC

Update (49388a8): Two targeted improvements to the composite action:

🔒 Security fix (NGC credential leak): Changed docker login -p $NGC_API_KEY to pipe the key via --password-stdin, preventing the secret from appearing in /proc/<pid>/cmdline on shared self-hosted runners. Good catch — this was a real exposure vector on multi-tenant machines.
📋 Observability improvement (cache fallback logging): Added explicit log messages when ECR cache is available vs unavailable. The else branch now logs 🟠 ECR unavailable: full local build with no shared cache (build-once degrades to build-per-job) — making degradation immediately visible in CI output rather than requiring log archaeology.

Both changes are non-functional (no logic change), improve security posture and debuggability. LGTM still holds. ✅

Re-validated at: 49388a8 | 2026-06-05T21:28 UTC

greptile-apps · 2026-06-05T20:41:05Z

Greptile Summary

This PR eliminates the duplicate ~23-minute Docker image build that previously ran independently in both test-isaaclab-tasks and test-general by introducing a dedicated build job that builds and pushes the image to ECR once, then having both test jobs pull it.

A new composite action ecr-build-push-pull implements a three-tier cache strategy: (1) exact per-commit ECR tag, (2) a content-hash-based deps tag, and (3) full docker buildx build with ECR layer cache — with a local-only fallback when ECR is unreachable.
build.yml gains a build job at the head of the graph (build → {test-isaaclab-tasks, test-general} → combine-results); the per-job docker-build steps are replaced by calls to the new action that will hit the ECR pull path after the build job completes.
The cache-tag: cache-main-curobo namespace keeps main's ECR layer cache isolated from develop/beta2.

Confidence Score: 4/5

Safe to merge — the workflow change is well-structured and the fallback path is sound; the only noted issue is a minor docker credential hygiene concern.

The build job correctly gates both test jobs via needs, so test jobs never run against a missing image. The three-tier ECR caching logic handles all cache states correctly, and the fallback to local-only build when ECR is unavailable preserves the pre-PR behavior. The one finding worth fixing is the NGC API key being passed as a -p flag to docker login rather than via --password-stdin, which briefly exposes it in the process table on the self-hosted runner.

.github/actions/ecr-build-push-pull/action.yml — docker login credential handling on line 66; otherwise the workflow changes in build.yml are straightforward.

Important Files Changed

Filename	Overview
.github/actions/ecr-build-push-pull/action.yml	New composite action implementing a multi-tier ECR caching strategy (exact commit tag → deps hash → full build); minor: docker login uses -p flag instead of --password-stdin
.github/workflows/build.yml	Adds a dedicated `build` job that pushes to ECR once; test jobs now `needs: [build]` and replace docker-build with ecr-build-push-pull to pull the prebuilt image — job graph and environment variables are unchanged

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[build job] --> B{resolve-ecr}
    B -- ECR available --> C{exact commit tag in ECR?}
    C -- hit --> D[Pull exact image from ECR]
    C -- miss --> E{deps hash in ECR?}
    E -- hit --> F[imagetools create alias commit tag]
    E -- miss --> G[Full docker buildx build]
    G --> H[Push commit tag to ECR]
    H --> I[Push deps tag to ECR]
    B -- ECR unavailable --> G2[Full local build, no push]
    D --> J[build job complete]
    F --> J
    H --> J
    G2 --> J
    J --> K[test-isaaclab-tasks]
    J --> L[test-general]
    K --> K1[ecr-build-push-pull: pulls or rebuilds]
    L --> L1[ecr-build-push-pull: pulls or rebuilds]
    K1 --> M[run-tests]
    L1 --> N[run-tests]
    M --> O[combine-results]
    N --> O

_{Reviews (1): Last reviewed commit: "Merge branch 'main' into jichuanh/main-c..." | Re-trigger Greptile}

greptile-apps · 2026-06-05T20:41:08Z

+
+        if [ -n "${{ env.NGC_API_KEY }}" ]; then
+          echo "🔵 Logging into nvcr.io..."
+          docker login -u \$oauthtoken -p ${{ env.NGC_API_KEY }} nvcr.io


Passing the NGC API key as a -p flag exposes the secret value in the process argument list (/proc/<pid>/cmdline) for the duration of the docker login call. Any process on the same self-hosted runner with read access to /proc can capture it. Piping via --password-stdin avoids this entirely.

Suggested change

docker login -u \$oauthtoken -p ${{ env.NGC_API_KEY }} nvcr.io

echo "${{ env.NGC_API_KEY }}" | docker login -u \$oauthtoken --password-stdin nvcr.io

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

greptile-apps · 2026-06-05T20:41:09Z

+    - name: Full build
+      if: steps.pull-exact.outputs.hit != 'true' && steps.deps-cache.outputs.deps-cache-hit != 'true'


Full build runs in test jobs when ECR unavailable

When resolve-ecr resolves no ECR URL, both pull-exact and deps-cache are skipped (their if conditions gate on resolve-ecr.outputs.available == 'true'). Because neither step ran, their outputs remain unset (not 'true'), so the Full build condition passes and each test job performs an independent full build from scratch — which is the correct fallback, but means the build-once goal silently degrades to build-three-times when ECR is unavailable. The PR description mentions this explicitly, so it is intentional; consider adding a log line in the action to make this fallback observable in the CI log.

Pass the NGC API key to docker login via --password-stdin instead of -p so it never appears in the process argument list on the shared self-hosted runner. Log a clear line in the full-build step when ECR is unavailable, so the degradation from build-once to build-per-job is visible in CI logs.

github-actions Bot added the infrastructure label Jun 1, 2026

isaaclab-review-bot Bot reviewed Jun 1, 2026

View reviewed changes

Merge branch 'main' into jichuanh/main-ci-ecr-build-once

c3c195c

hujc7 marked this pull request as ready for review June 5, 2026 20:38

hujc7 requested review from hhansen-bdai and kellyguo11 as code owners June 5, 2026 20:38

greptile-apps Bot reviewed Jun 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Build the test image once and pull it from ECR#5905

[CI] Build the test image once and pull it from ECR#5905
hujc7 wants to merge 3 commits into
isaac-sim:mainfrom
hujc7:jichuanh/main-ci-ecr-build-once

hujc7 commented Jun 1, 2026

Uh oh!

isaaclab-review-bot Bot left a comment •

edited

Loading

Uh oh!

greptile-apps Bot commented Jun 5, 2026

Uh oh!

greptile-apps Bot Jun 5, 2026

Uh oh!

greptile-apps Bot Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	docker login -u \$oauthtoken -p ${{ env.NGC_API_KEY }} nvcr.io
	echo "${{ env.NGC_API_KEY }}" \| docker login -u \$oauthtoken --password-stdin nvcr.io

		- name: Full build
		if: steps.pull-exact.outputs.hit != 'true' && steps.deps-cache.outputs.deps-cache-hit != 'true'

Conversation

hujc7 commented Jun 1, 2026

1. Summary

2. Changes

3. Adapted to main (not a develop bulk-copy)

4. Expected effect

5. Test plan

Uh oh!

isaaclab-review-bot Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

🤖 Isaac Lab Review Bot — PR #5905

Summary

🧪 Isaac Lab Expert

🔇 Silent Failure Hunter

📊 Test Coverage Analyzer

🏁 Verdict: LGTM ✅

Uh oh!

greptile-apps Bot commented Jun 5, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

isaaclab-review-bot Bot left a comment •

edited

Loading